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PATENT 



IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 



In rajKpplication of: Schulein et al. 



Confirmation No: 1 722 



'rial No.: 09/576,778 



Group Art Unit: 1652 



Filed: May 23, 2000 



Examiner: M. Rao 



For: Family 9 Endo-Beta-1 ,4-Glucanases 



APPEAL BRIEF 



Mail Stop Appeal Brief - Patents 
Commissioner for Patents 
P.O. Box 1450 
Alexandria, VA 22313-1450 



Sir: 



Applicants hereby appeal from the final rejection of claims 62, 63, 65-67, and 72-81. 

I. REAL PARTY IN INTEREST 

The name of the real party in interest in this appeal is Novozymes A/S. 

II. RELATED APPEALS AND INTERFERENCES 

There are no appeals or interferences relating to the instant application. 

III. STATUS OF THE CLAIMS 

Claims 62, 63, and 65-81 remain pending in the application. Claims 1-61 and 64 have 
been canceled. Claims 68-71 are objected to. A copy of the pending claims is attached hereto 
as Appendix 1 . Only claims 62, 63, 65-67, and 72-81 are included in this appeal. 

IV. STATUS OF AMENDMENTS 

The amendment filed under 37 C.F.R. § 1.116 on June 22, 2004 was considered, but 
has been stated as not overcoming the final rejection. 

V. SUMMARY OF THE INVENTION 

The invention relates to isolated enzymes exhibiting beta-1 ,4-endoglucanase activity (EC 
3.2.1 .4), which (a) has a temperature optimum of 65°C measured at a pH of 7.5 and (b)(i) has an 
amino acid sequence that is at least 90% identical to amino acids 1-456 or 1-617 of SEQ ID NO: 
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2 wherein identity is determined by GAP provided in the GCG program package using a GAP 
creation penalty of 3.0 and GAP extension penalty of 0.1 or (ii) is encoded by a DNA sequence 
that hybridizes to nucleotides 76-1455 of SEQ ID NO: 1 under high stringency conditions, 
wherein the high stringency conditions are defined as hybridization in 5xSSC at 45°C and 
washing in 2xSSC at 70°C. 

VI. ISSUES 

The outstanding issue is whether the specification enables the inventions of claims 62, 
63, and 65-81 . 

VII. GROUPING OF CLAIMS 

For purposes of determining patentability, the following claims are grouped together 



Group I: 


Claims 62, 63, 65, 72, and 78-81; 


Group II: 


Claim 66 




Group III: 


Claim 67 




Group IV: 


Claim 75 




Group V: 


Claim 73 




Group VI: 


Claim 74 




Group VII: 


Claim 76 


; and 


Group VIII: 


Claim 77. 



VIII. ARGUMENTS 

A. Claims 62, 63, And 65-81 Are Enabled By The Specification 

Claims 62, 63, and 65-81 stand rejected under 35 U.S.C. 112, first paragraph, because the 
specification does not enable any beta-1 ,4-endogglucanase that has at least 90%, 95% or 98% 
sequence identity with amino acids 1-456 or 1-617 of SEQ ID NO: 2. Applicants submit that this 
rejection is improper and should be reversed. 

It is well settled that an assertion by the Patent Office that the enabling disclosure is not 
commensurate in scope with the protection sought must be supported by evidence or reasoning 
substantiating the doubts so expressed. In re Dinh-Nguyen, 181 U.S.P.Q. 46 (C.C.P.A. 1974). 
See also U.S. v. Telectronics, 8 U.S.P.Q.2d 1217 (Fed. Cir. 1988); In re Bowen, 181 U.S.P.Q. 
48 (C.C.P.A. 1974); Ex parte Hitzeman, 9 U.S.P.Q.2d 1821 (BPAI 1988). 

Moreover, in the absence of any evidence or apparent reason why compounds do not 
possess the disclosed utility, the allegation of utility in the specification must be accepted as 



correct. In re Kamal, 158 U.S.P.Q. 320 (C.C.P.A. 1968). See also In re Stark, 172 U.S.P.Q. 
402, 406 n. 4 (C.C.P.A. 1972) (the burden is upon the Patent Office to set forth reasonable 
grounds in support of its contention that a claim reads on inoperable subject matter). 

In the present case, the Office has provided only arguments that the specification does not 
enable the claimed invention. It has not provided any evidence to support its arguments. For this 
reason alone, the rejection under 35 U.S.C. 112 should be reversed. 

Moreover, the specification enables the claimed invention. 

The Office argues at page 4 of the Office Action mailed January 16, 2004 that: 

While recombinant and mutagenesis techniques are known, it is not routine 
in the art to screen for multiple substitutions or multiple modifications, as 
encompassed by the instant claims, and the positions within a protein's sequence 
where amino acid modifications can be made with a reasonable expectation of 
success in obtaining the desired activity/utility are limited in any protein and the 
result of such modifications is unpredictable. In addition, one skilled in the art would 
expect any tolerance to modification for a given DNA to diminish with each further 
and additional modification, e.g. multiple substitutions. 

The specification contains an extensive disclosure of techniques which are well known in 
the art and routine for persons of ordinary skill in the art for identifying other beta-1 ,4- 
endoglucanases of the present invention and DNA sequences encoding same. For example, in 
the paragraph bridging pages 9-10 of the specification, Applicants describe methods for 
preparing and probing DNA libraries; cloning DNA sequences using polymerase chain reaction; 
and detecting an endoglucanase using an antibody raised against an endoglucanase from 
Bacillus licheniformis, ATCC 14580, i.e., the beta-1 ,4-endoglucanase of SEQ ID NO: 2. 

Moreover, the specification discloses at pages 13-14 that the beta-1 ,4-endoglucanase of 
SEQ ID NO: 2 can be mutated by methods known in the art. These methods include site-directed 
mutagenesis, random mutagenesis, and shuffling. The specification further discloses that the 
mutations are preferably of a minor nature, i.e., conservative substitutions. A table defining 
conservative amino acid substitutions is provided at page 12. At page 14, the specification 
references a number of patent applications which describe shuffling methods. The specification 
also discloses at pages 25-26 an assay for measuring beta-1 ,4-endoglucanase activity. 

Using these techniques, persons of ordinary skill in the art are able to routinely produce 
thousands of mutants of the beta-1 ,4-endoglucanase of SEQ ID NO: 2 in a short period of time. 
Moreover, these techniques have been used for decades to produce modified polypeptides 
having a desired function/utility. Following Applicants' disclosure, one of ordinary skill in the art 
would be able to isolate and identify the claimed enzymes. While some experimentation might 
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be necessary to determine to isolate and identify other beta-1 ,4-endoglucanases, such 
experimentation would require carrying out a simple process without special equipment or 
unusual reaction conditions. This experimentation, if required, would not be undue and certainly 
would not require ingenuity beyond that of one of ordinary skill in the art. Certainly, there is no 
evidence of record to the contrary. 

Moreover, attached as Appendix 2 is a BLAST search of a public sequence database 
(Protein Information Resource, PIR-NREF) using a sequence identity threshold of 90%. The 
enzymes which were searched were different fungal enzymes (amylase, acid protease, 
glucoamylase, exocellobiohydrolase, endoglucanase, phytase, lipase, and phospholipase B). The 
search gave only enzymes which possessed the same biological/biochemical activity. These 
results show clearly and convincingly that one of ordinary skill in the art would expect that proteins 
having 90% sequence identity would have the desired function/utility. 

These results are further supported by Wilson and colleagues (C.A. Wilson et al., 2000, J. 
Mol. Biol. 297: 233-249, a copy of which is attached hereto) who established a clear relationship 
between sequence similarity and functional similarity. Wilson et al. found that functional identity is 
conserved down to approximately 40% amino acid sequence identity, and that among proteins that 
share 50-100% sequence identity, function is conserved in almost all. It is noteworthy that Wilson 
et al. also found that the percent identity is more effective at quantifying functional conservation 
than probabilistic scores (P-values, E-scores). Thus, it is not "unpredictable" to make base 
changes within a nucleic acid's sequence and maintain the desired activity/utility, as suggested by 
the Examiner. 

The Office also argues at page 5 of the Office Action mailed January 16, 2004 that: 

The specification does not support the broad scope of the claims which 
encompass all modifications and fragments of any beta-1 ,4-endoglucanase having 
[90%] through 98% identity to SEQ ID NO: 2 because the specification does not 
establish: (A) regions of the protein structure which may be modified without 
effecting endoglucanase activity; (B) the general tolerance of beta-1 ,4- 
endoglucanase to modification and extent of such tolerance; (C) a rational and 
predictable scheme for modifying any beta-1 ,4-endoglucanase amino acid residues 
with an expectation of obtaining the desired biological function; and (D) the 
specification provides insufficient guidance as to which of the essentially infinite 
possible choices is likely to be successful. 

The Office's arguments are misplaced. In the paragraph bridging pages 13 and 14, the 
specification discloses that one of ordinary skill in the art can readily identify essential amino acids 
and the active site in the amino acid sequence of SEQ ID NO: 2 by methods known in the art. 
Specifically, the specification refers to the methods describes in Cunningham and Wells, Science, 
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244:1081-1085 (1989) for identifying essential amino acids and the methods described in Vos et 
al., Science, 255:306-312 (1992), Smith et al. f J. Mol. 6/o/., 224:899-904 (1992), and Wlodaver et 
al. FEBS Letters, 309:59-64 (1992) for identifying the active site. These techniques have been 
used routinely and successfully for many years for identifying essential amino acids and the active 
site. While some experimentation might be necessary to determine to isolate and identify 
essential amino acids and the active site of the beta-1 ,4-endoglucanase of SEQ ID NO: 2, such 
experimentation would require carrying out a simple process without special equipment or 
unusual reaction conditions. Again, this experimentation, if required, would not be undue and 
certainly would not require ingenuity beyond that one of ordinary skill in the art. Certainly, there 
is no evidence of record to the contrary. 

For the foregoing reasons, Applicants submit that the rejection under 35 U.S.C. 112 is 
improper. Accordingly, Applicants respectfully request that the rejection be reversed. 

IX. CONCLUSION 

For the foregoing reasons, Applicants submit that the rejections are improper. 
Accordingly, the final rejection of the claims should be reversed. 



Respectfully submitted, 



Date: November 12, 2004 




Elias J. Largpins, Reg. No. 33,728 
Novozymes North America, Inc. 
500 Fifth Avenue, Suite 1600 
New York, NY 10110 
(212) 840-0097 
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Appendix 1 
Copy of Pending Claims Of Which 
Claims 62, 63, 65-67, and 72-81 Are Involved in the Appeal 

62. An isolated enzyme exhibiting beta-1,4-endoglucanase activity (EC 3.2.1.4), which (a) has 
a temperature optimum of 65°C measured at a pH of 7.5 and (b)(i) has an amino acid sequence 
that is at least 90% identical to amino acids 1-456 or 1-617 of SEQ ID NO: 2 wherein identity is 
determined by GAP provided in the GCG program package using a GAP creation penalty of 3.0 
and GAP extension penalty of 0.1 or (ii) is encoded by a DNA sequence that hybridizes to 
nucleotides 76-1455 of SEQ ID NO: 1 under high stringency conditions, wherein the high 
stringency conditions are defined as hybridization in 5xSSC at 45°C and washing in 2xSSC at 
70°C. 

63. The enzyme of claim 62, which belongs to family 9 of glycosyl hydrolases. 

65. The enzyme of claim 63, which has an amino acid sequence that is at least 90% identical 
to amino acids 1-456 or 1-617 of SEQ ID NO: 2. 

66. The enzyme of claim 65, which has an amino acid sequence that is at least 95% identical 
to amino acids 1-456 or 1-617 of SEQ ID NO: 2. 

67. The enzyme of claim 66, which has an amino acid sequence that is at least 98% identical 
to amino acids 1-456 or 1-617 of SEQ ID NO: 2. 

68. The enzyme of claim 62, which comprises an amino acid sequence of amino acids 1-456 of 
SEQ ID NO: 2. 

69. The enzyme of claim 62, which comprises an amino acid sequence of amino acids 1-617 of 
SEQ ID NO: 2. 

70. The enzyme of claim 62, which consists of an amino acid sequence of amino acids 1-456 
of SEQ ID NO: 2. 
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71 . The enzyme of claim 62, which consists of an amino acid sequence of amino acids 1-617 
of SEQ ID NO: 2. 

72. The enzyme of claim 62, which is encoded by a DNA sequence that hybridizes to 
nucleotides 76-1455 of SEQ ID NO: 1 under high stringency conditions, wherein the high 
stringency conditions are defined as hybridization in 5xSSC at 45°C and washing in 2xSSC at 
70°C, 

73. The enzyme of claim 72, which is encoded by a DNA sequence that hybridizes to 
nucleotides 76-1455 of SEQ ID NO: 1 under high stringency conditions, wherein the high 
stringency conditions are defined as hybridization in 5xSSC at 45°C and washing in 2xSSC at 
75°C. 

74. The enzyme of claim 62, which is a Bacillus licheniformis enzyme. 

75. The enzyme of claim 74, which is a Bacillus licheniformis, ATCC 14580 enzyme. 

76. The enzyme of claim 62, which is active at a pH in the range of 4-1 1 . 

77. The enzyme of claim 76, which is active at a pH in the range of 5.5-10.5. 

78. An enzyme composition comprising the enzyme of claim 62. 

79. The composition of claim 78, which further comprises one or more enzymes selected from 
the group consisting of alpha-amylases, cellobiohydrolases, cellulases (endoglucanases), 
cutinases, beta-glucanases, glucoamylases, hemicellulases, laccases, ligninases, lipases, 
oxidases, pectate lyases, pectin acetyl esterases, pectinases, pectin lyases, pectin 
methylesterases, peroxidases, phenoloxidases, polygalacturonases, proteases, pullulanases, 
reductases, rhamnogalacturonases, xylanases, xyloglucanases, other mannanases, 
transglutaminases; and mixtures thereof. 

80. A method for degradation of cellulose-containing biomass, comprising treating the biomass 
with an effective amount of the enzyme of claim 62. 
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81. An enzyme exhibiting beta-1 ,4-endoglucanase activity (EC 3.2.1.4) which has an amino 
acid sequence comprising amino acids 1-456 or 1-617 of SEQ ID NO: 2. 
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Appendix 2 

Demonstration that proteins which share 50-100% amino acid sequence identity are annotated 

to have the same biochemical/biological function. 



A. Query sequence = Aspergillus niger glucoamylase (glucan 1,4-alpha-glucosidase (EC 
3.2.1.3)). The following BLASTP hits with at least 50% sequence identity are all annotated as 
glucoamylase enzymes (except those noted as hypothetical or unnamed products): 



Abstract 


E-value 


% Identity! 


DirnreflNF00073574l Glucoamvlase I precursor (EC 3.2.1 .3) (Gluca... I 


0.0 


100 


DirnreflNF00889958l alucan 1.4-alDha-o.lucosidase (EC 3.2.1.3) Dr.. . 


0.0 


100 ! 


DimreflNF00626751l alucan 1.4-alDha-alucosidase (EC 3.2.1.3) Dr... 


0.0 


98 J 


DirnreflNF00626853l Glucoamvlase Drecursor (EC 3.2.1.3) (Glucan ... J 


0.0 


98 _j 


|DirnreflNF00889947l Glucoamvlase Drecursor (EC 3.2.1 .3) rAsDerqi... i 


0.0 


97 _ 


|Dirnref|NF01 651 009I qlucoamvlase rAsDerqillus awamoril \ 


0.0 


96 I 


]DimreflNF00626460l Glucoamvlase Drecursor (EC 3.2.1.3) (Glucan ... 


0.0 


94 


DirnreflNF00889975l Glucoamvlase Drecursor (EC 3.2.1.3) (Glucan ... I 


0.0 


94 


DirnreflNF0 1328097 1 Glucoamvlase fAsDerqillus niqerl 


0.0 


93 


DirnreflNF00626366l DreDroqlucoamvlase G2 rAsDerqillus niqerl 


0.0 


_ 93 _ i 


DirnreflNF00889945l qlucan 1 .4-alDha-qlucosidase (EC 3.2.1 .3) G2... 


0.0 


93 


Dirnref|NF00889968| Glucoamvlase-471 rAsDerqillus awamoril 


0.0 


98 


DirnreflNF00889964l Glucoamvlase-471 rAsoeraillus awamoril 


0.0 


98 


Dirnref|NF00889951l Glucoamvlase-471 (1.4-Abha-D-Glucan Glucohv...! 


0.0 _ 


_ 98 


DirnreflNF00626575l Glucoamvlase Drecursor (EC 3.2.1.3) (Glucan ... 


0.0 


66 


DirnreflNF00494189l Glucoamvlase Drecursor (EC 3.2.1 .3) rTalarom... 


0.0 


61 


DirnreflNF00649388l alucan 1 ,4-alDha-olucosidase (EC 3.2.1 .3) Dr... 


0.0 


55 


Dirnref|NF00647663l Glucoamvlase Drecursor (EC 3.2.1 .3) (Glucan ... 


0.0 


55 


DirnreflNF00648280l qlucan 1 .4-alDha-qlucosidase TNeurosDora era... I 


0.0 


55 _ 


DirnreflNF01576653l hvDOthetical protein MG01096.4 rMaqnaDorthe .... | 


0.0 


53 


Dirnref|NF01709909l hvDOthetical protein FG06278.1 fGibberella z... 


e-173 


50 


B. Query sequence = Aspergillus niger aspergillopepsin (acid proteinase/aspartyl 
protease/preproproctase). The following BLASTP hits with at least 50% sequence identity are 
all annotated as acid protease/aspartyl protease enzymes (except those noted as hypothetical or 
unnamed products): 


Abstract 


E-value 


% Identity 


DirnreflNF00626537l AsDerailloDeDSin A Drecursor (EC 3.4.23.18) ... 


0.0 


100 


DirnreflNF00626722l asDerqilloDepsin I (EC 3.4.23.18) Drecursor ... 


0.0 j 


99 | 


DirnreflNF00918479l AsDerailloDeDSin A Drecursor (EC 3.4.23.18) ... 


o.o i 


99 
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Dirnref|NF00626425l PreDroDroctase B precursor rAsperaillus niqerj_ 


0.0 ! 


96 


pirnref|NF00889972| AsDerailloDeDsin A precursor (EC 3.4.23.18) ... 


0.0 


96 


|oirnreflNF00626729l AsDeraillopepsin rAsperaillus Dhoenicisl 


I o.o | 


| 99 


joirnref NF00627288I AsDerailloDeosin i (EC 3.4.23.18) rAsperaill... 


| e-163 


r 71 


|pirnreflNF00626684l AsDeraillopepsin A precursor (EC 3.4.23.18) .... 


| e-155 j 


I 67 


oirnreflNF00626584l asDeraillopeosin 0 rAsperaillus orvzael 


e-155 


67 


oirnreflNF00626993l Propenicillooeosin-JT2 precursor fPenicilliu... 


| e-152 i 


67 


oirnreflNF00176292l Putative aspartic protease rEmericella nidul... 


e-145 


65 


pirnreflNF00889953l asperaillopepsin I (EC 3.4.23.18) rAsperaill... 


e-144 


81 


pirnref|NF00627293l Asperaillopepsin F precursor (EC 3.4.23.18) ... 


e-142^ 


66 


pirnreflNF0 14637771 acid proteinase TMonascus ourpureusl 


e-141 


63 


pirnref|NF00627188| Aspartic proteinase rPenicillium roauefortiil 


e-138 


63 


pirnref|NF00626580| Aspartic proteinase 11-1 rAsperaillus orvzael 


e-137 


65 


pimreflNF0 12295061 Aspartic Proteinase rAsperaillus orvzael 


e-129 


70 


pirnref|NF00627002| Prepropenicillopepsin-JT3 precursor fPenicil... 


e-129 


58 


pirnreflNF00626995l Penicillooepsin (EC 3.4.23.20) (Peptidase A)... 


e-127 


68 


pirnreflNF00626992l penicillooepsin (EC 3.4.23.20) rPenicillium ... 


e-127 


| 68 _ 


pirnreflNF00747468l Aspartic proteinase precursor (EC 3.4.23.-) ... 


e-104 


I 49 


oirnreflNF00646517l Endothiaoepsin precursor (EC 3.4.23.22) (Asp... 


e : 103 


| 50 


pirnreflNF01576663l hypothetical protein MG02898.4 rMaanaoorthe ... 


e-103_ 


49 


oirnreflNF00646493l Endothiaoepsin rCrvphonectria parasitical 


e-102 


55 


C. Query sequence = Aspergillus oryzae a-amylase (AMY1, Taka-amylase, amyA). The 

following BLASTP hits with at least 50% sequence identity are all annotated as a-amylase 
enzymes (except those noted as hypothetical or unnamed products): 


Abstract ] 


E-value | 


% Identity 


pirnreflNF00626669l Alpha-amvlase A precursor (EC 3.2.1.1) (Taka... 


0.0 


100 


pimref|NF01 6510081 alpha-amvlase rAsperaillus awamoril I 


0.0 | 


99 


pirnref|NF00626583| unnamed protein product rAsperaillus orvzael 1 


0.0 | 


100 


pirn ref INF0 15449441 alpha-amvlase rAsperaillus kawachiil 


0.0 | 


100 


oirnreflNF00626750l alpha-amvlase (EC 3.2.1.1) precursor rAspera... 


0.0 | 


99 


oimreflNF00626854l Alpha-amvlase precursor (EC 3.2.1 .1) (1 ,4-al... \ 


0.0 j 


99 


pimreflNF00626612l Taka-amvlase A (EC 3.2.1.1) (Alpha-amvlase) ... 


0.0 J 


99 


pirnreflNF00625791l Taka-amvlase A (EC 3.2.1.1) (Alpha-amvlase) ... j 


0.0 


99 


oirnreflNF00626368l alpha-amvlase-precursor rAsperaillus niqerl 


0.0 


99 


oirnreflNF00889948l Alpha-amvlase B precursor (EC 3.2.1 .1 ) (1 .4-... 


0.0 


99 


pirnreflNF00626646l alpha-amvlase (EC 3.2.1.1) precursor rAsoera... 


0.0 


99 


pirnref|NF00626648l Taka-amvlase A (Taa-G1) precursor rAsperaill... 


0.0 


_ 99 
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DirnreflNF00626351l alDha-amvlase-orecursor rAsoeraillus niaerl 


0.0 


99 


DirnreflNF00889969l AlDha-amvlase A precursor (EC 3.2.1.1) (1.4-.. . j 


0.0 


99 


DirnreflNF00626590l aloha-amvlase (EC 3.2.1.1) precursor rAsoerq... 


0.0 ' 


99 


DirnreflNF00626638l Taka Amylase rAsoeraillus orvzael 


0.0 j 


100 


DirnreflNF00626642l alDha-amvlase (EC 3.2.1.1) [Asoerqillus orvzael \ 


o.o ; 


97 | 


DirnreflNFOOl 760341 Aloha-amvlase AmvA TEmericella nidulansl j 


0.0 | 


69 


oirnreflNF00073571 1 Acid-stable abha-amvlase rAsoeraillus kawac... ) 


0.0 


67 


DirnreflNFOI 6510071 abha-amvlase rAsoeraillus awamoril 


o.o ! 


68 


Dirnref|NF00626487l abha-amvlase (EC 3.2.1.1) [Asoerqillus niqerl 


0.0 J 


67 




0.0 


66 


DirnreflNFOOl 762031 Abha-amvlase lEmencella nidulansl 


0.0 


63 


DirnreflNFOI 7526341 abha-amvlase precursor rLbomvces starkevil 


0.0 s 


60 


DirnreflNF00756572l unnamed protein oroduct [Thermomvces 


0.0 


60 


lanucjin... 


Dirnr6T|iNruu4uyn £.o\ Lipomvces KononeriKoae suuap. 


I 

e-180 • 


57 


SDencermartinsi... 


DirnreflNFOOl 861 591 AlDha-amvlase 1 precursor (EC 3.2.1.1) (1.4-.. . i 


e-180 j 


56 


DirnreflNF00490302l AlDha-amvlase 2 Drecursor (EC 3.2.1.1) (1.4-.. . 


e-167 | 


56 


DirnreflNF00490307l AlDha-amvlase 1 Drecursor (EC 3.2.1.1) (1.4-.. . J 


e-159 i 


56 


DirnreflNF00490293l abha-amvlase (EC 3.2.1.1) Drecursor TDebarv... 


e-158 J 


55 


DirnreflNF00490296l abha-amvlase IDebarvomvces occidentalisl , 


e-158 


54 


DirnreflNFOOl 555691 alDha-amvlase Isvnthetic constructl 


e-153 J 


53 


D. Query sequence = Hypocrea jecorina (Trichoderma reesei) exocellobiohydrolase 1 
(CBH1, exoglucanase, cellobiohydrolases, 1,4,-p-glucan cellobiohydrolase). The following 
BLASTP hits with at least 50% sequence identity are all annotated as exocellobiohydrolase 
enzymes (except those noted as hypothetical or unnamed products): 


Abstract 


E-value 


% Identity | 


DirnreflNF00769949l Exoalucanase I Drecursor (EC 3.2.1 .91) (Exoc... 


0.0 


100 


DirnreflNFOI 0421781 cellulose 1 .4-beta-cellobiosidase (EC 3.2.1. ... 


0.0 


100 


DirnreflNF00494383l Exoalucanase I Drecursor (EC 3.2.1.91) (Exoc... 


0.0 


100 


DirnreflNF01470257l cellobiohvdrolase I rTrichoderma viridel 


0.0 


99 


DirnreflNF00756631l Cellobiohvdrolase I rTrichoderma viridel 


0.0 


95 


DirnreflNF00756635l Exoalucanase I Drecursor (EC 3.2.1.91) (Exoc... 


0.0 


94 | 


Dirnref NF00494360I Cellobiohvdrolase I THvoocrea iecorinal 


0.0 


100 ! 


DirnreflNF00494368l 1 .4-Beta-D-Glucan Cellobiohvdrolase I rHvDoc... 


0.0 


99 


DirnreflNF00494367l 1 ,4-Beta-D-Glucan Cellobiohvdrolase I fHvDoc... 


0.0 


| 99 


DirnreflNF00494366l 1 .4-Beta-D-Glucan Cellobiohvdrolase I IHvdoc... 


0.0 


„ .99 J 
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nirnrpflNF015?44?4l Fvnrpllobiohvdrolase I FHvoocrea iecorinal 


0.0 I 


99 


r^irnr^flMCnnylQ/lQOOl i /I Data R f^li iran Ppllnhinhv/Hrnlacp- PpI7 TH\/ 
DirnrGTIlNrUU^-y^oZZI 1 1 *f-Deia~U*VJlUUaM OGllUUiUllyUIUIaoe Ocl/ Iny... 


0 0 l 


98 


pirnretlNrUl o^44oo| bxoceiiODionvaroiase i inypocrea iecorinal 


U.U I 


Q7 


pirnretiNrUU/oboyui oeiiooionvaroiase inypocrea iixiij 


U.U 


OVJ 


DirnreTlNr0l2bolo7| unnamed protein product lAcremonium 


0.0 


63 


thermoph... 


nirnrpflNIFfH pfi*si Q1 1 unnamed nrotein oroduct rChaetomidium oinatu 


0.0 J 


62 


pirnreiiNrUUooon o^i Avianase/ceiioDionvaroiase precursor ico o.z... 


n n 




pirnreTiiNrLn ztScjyi oi unnamea proxein proauci itxiaia gianuuiosai 


U.U 


U I 


pirnretiNrUUo^oou 1 1 i ,4-peta-u-aiucan ceiioDionvoroiase d precur... 


n n 

U.U 




pirnreilNrUi ^bb^/o| unnamea proxein proaucx lunaeiomium 


! 

0.0 


58 


thermoph... 


pirnref|NF01489984l Hypothetical protein [Neurospora crassal 


0.0 J 


59 


oirnreflNF01258404l unnamed protein product fScvtalidium thermop... 


0.0 J 


57 


DirnreflNF00992461 1 1 ,4-beta-D-alucan-cellobiohvdrolvase (EC 3.2... 


0.0 


58 


DirnreflNF00756327l Cellulase (EC 3.2.1.91) [Humicola ariseal 


0.0 I 


57 


pirnref|NF01286453l unnamed protein product TThermoascus auranti...! 


0.0 I 


65 


oirnreflNF00756321l Exoalucanase I precursor (EC 3.2.1.91) (Exoc... 


0.0 


57 


oirnreflNF00801 1 94I Cellulase CEL7A TLentinula edodesl 


0.0 I 


60 


pirnref|NF01257476l unnamed protein product [Thielavia australie... 


0.0 I 


56 


nirnrpflNFfinfiPSfifi?! Fxnnlnranasp I nrpcursor (EC 3 2 1 91) (Exoc 


0.0 


58 


pirnrsi jiNrUUc/D^ou / 1 v^cMUUiunyuiuidot; i i i rmriiiuciouuo duicuuiaisuoi 


p-180 I 

c 1 \J\J 1 


64 


pirnref|Nr00o5D0Do| uellODionvdrolase I inermoascus aurantiacusi 


e-n ( y 1 


D*f 


pirnref|Nr00959024| CellODionvdrolase 1 catalytic domain (bo 0.2... 


a A. 7Q j 

I 




pirnret|Nr0l709bbol ouxu rubux Putative exooiucanase type 0 


e-179 


57 


prec... 

rr' 


pirnref|NF01 0535141 Cellobiohvdrolase C [Aspergillus orvzael 


e-179 


63 


pirnreflNF00755639l Putative exoalucanase type C precursor (EC 3... 


e-179 


57 


pirnref|NF00626413| 1,4-beta-D-qlucan cellobiohvdrolase A precur... 


e-177 


65 


pirnref|NF00627000l Exoalucanase I precursor (EC 3.2.1.91) (Exoc... 


e-177 


57 


pirnref|NF01696049l cellobiohvdrolase C fGibberella zeael 


e-176 


57 


pirnreflNFOI 2889141 unnamed protein product [Exidia alandulosal 


e-176 


63 


pirnref|NF00646477| Exoalucanase 1 precursor (EC 3.2.1.91) (Exoc... 


e-174 


62 


pirnreflNFOI 5818761 hypothetical protein MG06834.4 TMaanaoorthe ... 


e-174 ) 


62 


pirnreflNF00648798l Exoalucanase 1 precursor (EC 3.2.1.91) (Exoc... 


e-170 | 


56 


pirnreflNFOI 2651881 unnamed protein product [Trichophaea saccatal 1 


e-169 


60 


pirnref|NF00992462l 1.4-beta-D-alucan-cellobiohvdrolvase (EC 3.2... 


e-168 | 


60 


pirnreflNFOI 05351 1 1 Cellobiohvdrolase D [Asperaillus orvzael 


e-163 


60 
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pimref|NF00731508l cellulose 1 .4-beta-cellobiosidase (EC 3.2.1. ... 


e-163 I 


54 


pirnref|NF00731784| Cellulase precursor Mrpex lacteusL 


e-162 J 


52 


DirnreflNF00733334l Exoalucanase precursor (EC 3.2.1.91) (Exocel... 


e-162 | 


54 


DirnreflNF00731509l cellulose 1 .4-beta-cellobiosidase (EC 3.2.1. ... 


e-161 J 


54 


DirnreflNF00731785l Exocellulase precursor Nrpex lacteusl 


e-160 | 


53 


E. Query sequence = Hypocrea jecorina (Trichoderma reesei) endoglucanase I I (EG1, 
endo-1,4-P-glucanase, 1,4,-p-glucan glucanhydrolase). The following BLASTP hits with at 
least 50% sequence identity are all annotated as endoglucanase enzymes (except those noted 
as hypothetical or unnamed products): 


Abstract 


E-value I 


% Identity | 


Dirnref|NF00494331l Endoalucanase EG-1 precursor (EC 3.2.1.4) (E... 


0.0 


100 


pirnref|NF01407727| Endoalucanase I rTrichoderma viridel I 


0.0 


99 


oirnreflNF00756647l Endoalucanase EG-1 precursor (EC 3.2.1.4) (E... 


0.0 


94 


pimref|NF00756639l Endoalucanase I rTrichoderma viridel 


0.0 


93 


oirnreflNFOOl 546491 ENDO II Tsvnthetic construct 


_ao ; 


97 


oirnreflNF00494347l Endoalucanase 1 fHvoocrea iecorinal i 


0.0 


100 


pimreflNF00793302l unnamed protein product TTalaromvces 


e-121 


55 


emersoniil 


oirnreflNF00626671l Endo-1 .4-beta-alucanase (EC 3.2.1.4) fAspera... I 


e-107 


51 


F. Query sequence = Coprinus cinereus laccase (polyphenoloxidase, bilirubin oxidase, 
multicopper oxidase). The following BLASTP hits with at least 50% sequence identity are all 
annotated as laccase enzymes (except those noted as hypothetical or unnamed products): 


Abstract 


( E-value ! 


% Identity 


pimref|NF00733435l Laccase 2 (EC 1.10.3.2) rCoorinoosis cinereal 


i 00 _ 


1 00 


pirnreflNF00733482l Laccase 3 (EC 1.10.3.2) rCoorinoosis cinereal 


I 0-0 I 


78 


pirnref|NF01638355| laccase 3 rCoprinopsis cinereal 


0.0 ! 


78 


oirnreflNFOI 3861731 Laccase 4 (EC 1.10.3.2) TPIeurotus saior-caiul 


J 0.0 


64 


pirnreflNF00731916l Laccase 2 precursor (EC 1.10.3.2) (Benzenedi... 


0.0 : 


64 


oirnreflNFOI 3861761 Laccase 5 (EC 1.10.3.2) TPIeurotus saior-caiul 


I 0.0 J 


66 


oirnreflNF00731901l Bilirubin oxidase (Laccase) Pleurotus ostre... 


0.0 [ 


64 


pirnreflNF01567552l laccase Pleurotus ostreatusl 


» oo I 


64 


oirnreflNFOI 3861721 Laccase 2 (EC 1.10.3.2) fPleurotus saior-caiul 


o.o : 


67 


Dirnref INF01 461 741 1 laccase rRiaidooorus microporusl 


0.0 


65 


pirnreflNF01461740l laccase rRiaidooorus microporusl 


0.0 


65 


pirnreflNFOI 3861751 Laccase 1 (EC 1.10.3.2) fPleurotus saior-caiul 


J., o.o _ 


62 


oirnreflNF00731925l Laccase 1 precursor (EC 1.10.3.2) (Benzenedi... 


{ 0.0 | 


62 
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pirnref|NF01 6372491 laccase [Pleurotus ostreatusl [Pleurotus duL. ] 


0.0 


62 


DirnreflNF00427931 1 Laccase precursor [Funalia troaiil J 


0.0 


63 


Dirnref|NF00758114l Laccase precursor (EC 1.10.3.2) [basidiomvce... ! 


0.0 j 


63 j 


Dirnref|NF00939119l Polvohenoloxidase (EC 1.10.3.2) (Laccase 1) ... j 


0.0 | 


63 


Dirnref|NF00232919l Polvohenoloxidase (EC 1.10.3.2) (Laccase 1) ... 


0.0 | 


63 


Dirnref|NF01689789l laccase [Pleurotus ostreatusl 


0.0 i 


62 


DirnreflNF00993008l Laccase 2 (EC 1.10.3.2) [Trametes pubescensl 


0.0 


64 


DirnreflNF00732955l Laccase (EC 1.10.3.2) [Schizoohvllum commune], j 


0.0 


63 


pirnref|NF00731988l Laccase precursor (EC 1.10.3.2) (Benzenediol... | 


0.0 I 


63 


DirnreflNF00731957l laccase (EC 1 .10.3.2) A rTrametes versicolorl I 


0.0 | 


63 


Dirnref|NF00731968l lianinolvtic Dhenoloxidase (EC 1.10.-.-) 2 d... 


0.0 | 


63 


Dirnref|NF01057965l Laccase 2 [Trametes versicolor^ 


0.0 J 


64 


DirnreflNF00044343l Laccase 2 precursor (EC 1.10.3.2) (BenzenedL. | 


I o.o I 


64 


pirnref|NF00731635l Laccase precursor (EC 1.10.3.2) (Benzenediol... J 


I 0-°...J 


63 


oirnreflNF00059532l Laccase precursor (EC 1.10.3.2) [Coriolus ve... j 


I 00 I 


63 


pirnref|NF00731977 laccase I [Trametes versicolorl \ 


o.o I 


64 


oirnreflNF00788162l Laccase [Pvcnoporus coccineusl f 


0.0 i 


63 


oirnreflNF00731989l lianinolvtic phenoloxidase [Trametes hirsutal 


o.o ! 


63 


DirnreflNF00788163l Laccase precursor fPvcnoDorus coccineusl 


0.0 i 


| 63 


DirnreflNF00945391 1 unnamed protein Droduct fun identified! | 


0-0 I 


64 


DirnreflNF00909761l Laccase (Laccase 1) [Lentinula edodesl | 


I o.o | 


64 


DirnreflNF00044346l Laccase 1 precursor (EC 1.10.3.2) (Benzenedi... J 


0.0 I 


63 


DirnreflNF00731959l Laccase 2 precursor (EC 1.10.3.2) (Benzenedi... I 


I o.o I 


64 


oirnreflNF00964375l Laccase III (EC 1.10.3.2) rTrametes versicolorl I 


0.0 | 


63 


pirnref|NF00731973| Laccase precursor (EC 1 .10.3.2) rTrametes ve... | 


0.0 | 


63 


oirnreflNF00801 1881 Laccase B precursor (EC 1.10.3.2) [Trametes ... | 


o.o ! 


63 


pirnref|NF01 01 29461 Laccase rTrametes versicolorl | 


| 0.0 | 


64 


oirnreflNF00050826l Laccase LCC3-1 (EC 1.10.3.1) Polvoorus cili... I 


0.0 | 


63 


pirnreflNF00427925 Laccase (EC 1 .10.3.2) rCorioloosis aallical I 


0.0 | 


63 


pirnreflNF01470431 1 laccase rTrametes so. I-621 I 


0.0 I 


| 63 


pirnreflNF01470433l laccase rTrametes sp. I-621 j 


0.0 | 


| 63 _ 


pirnreflNF00466979l Phenoloxidase (EC 1 .10.3.2) rTrametes sp. 1-621 


0.0 


62 


pirnreflNF01470432l laccase rTrametes sp. I-621 ! 


I 00 t 


62 


pirnreflNF00466977l Phenoloxidase (EC 1.10.3.2) rTrametes sp. 1621 J 


| 0.0 


62 


pirnreflNF01470430l laccase rTrametes sp. I-621 | 


| 0.0 j 


62 _ 


pirnreflNF00801189l Laccase 1 (EC 1.10.3.2) rTrametes versicolorl 


0.0 I 


63 
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G. Query sequence = Thermomyces lanuginosus (Humicola lanuginose) lipase. The 
following BLASTP hits with at least 50% sequence identity are all annotated as lipase enzymes 
(except those noted as hypothetical or unnamed products): 



Abstract 


E-value 


% Identity 


l oirnreflNF00756570l LiDase precursor (EC 3.1.1.3) (Triacvlqlvcer... 


e-171 I 


100 


DirnreflNF00756566l LIPASE IThermomvces lanuqinosusl 1 


e-159 | 


100 


DirnreflNF00756565l Lipase (E.C. 3.1.1.3HTriacvlalvcerol Acvlh... I 


e-159 | 


100 


DirnreflNFO1 1881781 Lipase TThermomvces lanuainosusl 


e-158 j 


99 


nirnr^flMFO1 1 Ci9dRR\ nnnflmpfl nrntpin nroduct ITalaromvces 


e-153 


88 


thermoD... 


DirnreflNFO1 11 16331 unnamed protein product rThermomvces 


e-136 


78 


ibadane... 


DirnreflNFO1 1141921 unnamed protein product [Talaromvces 


5e-97 


61 


emersoniil 


oirnreflNFO1 1057061 unnamed protein product rTalaromvces I 


2e-89 


57 


bvssoch... I 


oirnref|NF00626823| unnamed protein product [Aspergillus tubinqe... | 


8e-77 


50 


oirnreflNFOOl 583071 unnamed protein product funidentifiedl 


1e-76 


50 


H. Query sequence = Fusarium venenatum phospholipase B (lysophospholipase). The 

following BLASTP hits with at least 50% sequence identity are all annotated as phospholipase 
B/lysophospholipase enzymes (except those noted as hypothetical or unnamed products): 


Abstract 


E-value \ 


% Identity 


oirnreflNFOI 7069291 hypothetical protein FG03875.1 rGibberella z... 


0.0 


89 


|pirnref|NF01527853| Ivsophospholipase (Ipl) fNeurospora crassal 


0.0 


57 


|oirnreflNF01488846l Hypothetical protein ((AF045574) Ivsophosoho... 


0.0 


58 


|oirnreflNF00648138l Ivsoohosoholioase TNeurosoora crassal J 


0.0 




58 


pirnreflNF00649056l Lvsoohospholioase precursor (EC 3.1.1.5) (Ph... 


0.0 


58 


pirnref|NF00626656l unnamed protein product rAsoerqillus orvzael | 


0.0 


53 


oirnreflNFOI 6710551 unnamed protein product rAsoerqillus niqerl 


0.0 _ i 


55 I 


pirnreflNFOI 4704391 lvsoohospholioase rAsoerqillus fumiqatusl 


0.0 


52 ) 


pirnref|NF00626635| unnamed protein product TAsperqillus orvzae]_ 


0.0 


52 


pirnref|NF00626497| unnamed protein product FAsperqillus niqerl 


0.0 


50 


oirnreflNFOI 671 0481 unnamed protein product rAsoerqillus niqerl [ 


0.0 


55 J 


oirnreflNF00626475| unnamed protein product [Asperqillus niqerl 


0.0 J 


55 j 


pirnreflNFOI 5738731 hypothetical protein MG07287.4 TMaqnaporthe ...i 


e-180 j 


51 


oirnreflNF01470438l lvsoohospholioase TAsperaillus fumiqatusl 1 


e-180 j 


53 


,pimreflNF00626946l Lvsoohospholioase precursor (EC 3.1.1.5) (Ph... j 


e-179 | 


52 
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Measuring in a quantitative, statistical sense the degree to which struc- 
tural and functional information can be "transferred" between pairs of 
related protein sequences at various levels of similarity is an essential 
prerequisite for robust genome annotation. To this end, we performed 
pairwise sequence, structure and function comparisons on —30,000 pairs 
of protein domains with known structure and function. Our domain 
pairs, which are constructed according to the SCOP fold classification, 
range in similarity from just sharing a fold, to being nearly identical. Our 
results show that traditional scores for sequence and structure similarity 
have the same basic exponential relationship as observed previously, 
with structural divergence, measured in RMS, being exponentially related 
to sequence divergence, measured in percent identity. However, as the 
scale of our survey is much larger than any previous investigations, our 
results have greater statistical weight and precision. We have been able 
to express the relationship of sequence and structure similarity using 
more "modern scores," such as Smith- Waterman alignment scores and 
probabilistic P-values for both sequence and structure comparison. These 
modern scores address some of the problems with traditional scores, 
such as determining a conserved core and correcting for length depen- 
dency; they enable us to phrase the sequence-structure relationship in 
more precise and accurate terms. We found that the basic exponential 
sequence-structure relationship is very general: the same essential 
relationship is found in the different secondary-structure classes and is 
evident in all the scoring schemes. To relate function to sequence and 
structure we assigned various levels of functional similarity to the 
domain pairs, based on a simple functional classification scheme. This 
scheme was constructed by combining and augmenting annotations in 
the enzyme and fly functional classifications and comparing subsets of 
these to the Escherichia coli and yeast classifications. We found sigmoidal 
relationships between similarity in function and sequence, with clear 
thresholds for different levels of functional conservation. For pairs of 
domains that share the same fold, precise function appears to be con- 
served down to —40 % sequence identity, whereas broad functional class 
is conserved to ^25 %. Interestingly, percent identity is more effective at 
quantifying functional conservation than the more modern scores (e.g. P- 
values). Results of all the pairwise comparisons and our combined func- 
tional classification scheme for protein structures can be accessed from a 
web database at http://bioinfo.mbb.yale.edu/align 
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Introduction 

The problem of genome annotation 

Perhaps the most valuable information to be 
gained from a genome analysis is functional anno- 
tation of all the gene products. Unfortunately, of 
all the proteins whose sequences are known, func- 
tions have been experimentally determined for 
only a very small number (Andrade & Sander, 

1997) . Given the current size and accessibility of 
sequence and structure data, homologs of a newly 
sequenced gene's product can be identified via 
database searches, and probable structure and 
function assigned to the gene product (Bork et ah, 

1998) . This is based on the concept that sequence 
similarity implies structural and functional simi- 
larity. However, structural and functional annota- 
tions should be transferred with caution. If a 
protein is assigned an incorrect function in a data- 
base, the error could carry over to other proteins 
for which structure or function is inferred by hom- 
ology to the errant protein (Brenner, 1999; Karp, 
1996, 1998a). In large databases such an error can 
propagate out of control, presenting a serious qual- 
ity control issue as we move to larger genomes 
from multicellular organisms. 

Benchmarking fold and function recognition 

Here, we used manually curated structural and 
functional classifications as standards in analyzing 
to what degree annotations of a protein's structure 
and function can be transferred to a similar 
sequence. The knowledge gained from the study 
can be used to establish confidence levels for struc- 
ture and function prediction, improving our under- 
standing of how long it will take to annotate 
accurately an entire genome. 

Our simultaneous analysis of relationships 
between sequence and structure, sequence and 
function, and structure and function (Figure 1) 
may provide insight into paradigms for functional 
prediction other than that based alone on sequence 
similarity (Enright et ah, 1999). 

Past results 

Sequence-structure 

The transfer of structural annotation is well 
characterized. Chothia & Lesk (1986, 1987) found 
that structural divergence, when expressed in 
terms of the RMS separation of matching alpha 
carbon atoms, was an exponential function of 
sequence divergence, expressed in terms of the 
fraction of residues that differed between 
sequences. The reliability of structural annotation 
transferred by homology, then, depends on the 
sequence identity of the homologous proteins 
(Chothia & Lesk, 1986). Flores et ah (1993), Russell 
& Barton (1994), and Russell et ah (1997) observed 
the same general trend, and also characterized the 
conservation of structural features other than the 



C a backbone, such as secondary structure, accessi- 
bility and torsion angles. A paper by Wood & 
Pearson (1999) re-expressed the sequence-structure 
relationship in terms of statistically based "Z- 
scores" and found that this relationship had a 
simple linear form in terms of these scores. They 
also noted that protein families differed in detail in 
the slope of this linear relationship. 

Others have focused on the limits of sequence 
comparison, specifically around the "twilight 
zone," the region of sequence similarity that does 
not reliably imply structural homology (Doolittle, 
1987), and on establishing cut-offs for significant 
sequence similarity. Using the SCOP structural 
classification (Murzin et ah, 1995), Brenner et ah 
(1998) benchmarked the effectiveness of the popu- 
lar FASTA and BLASTP programs and their prob- 
abilistic scoring schemes (i.e. the e-value) (Pearson 
& Lipman, 1988; Pearson, 1996; Altschul et ah, 
1990, 1994; Karlin & Altschul, 1993). They found 
that in making fold assignments, the FASTA 
e-value closely tracked the number of false posi- 
tives, i.e. the error rate, and that at a conservative 
e-value cut-off of 0.001, the FASTA program could 
detect nearly all the relationships that would be 
detected by a full Smith-Waterman comparison 
(Smith & Waterman, 1981). Specifically, they found 
that FASTA with a 0.001 threshold would find 
16% more of the structural relationships in SCOP 
than would be found by standard sequence com- 
parison with a 40% identity threshold. This rigor- 
ous benchmarking approach has been extended to 
assess transitive sequence comparison, through a 
third intermediate sequence and multiple-sequence 
matching programs such as PSI-blast (Park et ah, 
1997, 1998; Gerstein, 1998a; Salamov et ah, 1999). In 
a related study Rost (1999) worked on characteriz- 
ing the region after the twilight zone, which he 
called the "midnight zone". In a sense these bench- 
marking studies have culminated in the CASP fold 
recognition experiments (Moult et ah, 1997; 
Sternberg et ah, 1999). 



Sequence-function 

Although the exact dependence of functional 
similarity on sequence and structural similarity is 
not completely clear, initial indications of a gene 
product's function are most often based on simple 
sequence similarity (Bork et ah 1994, 1998). Often 
these are merely based on the best hit in database 
comparisons; see, for example, the annotation of 
some of the early genomes (Fraser et ah, 1995, 
1998). However, possibilities for more robust anno- 
tation transfer are increasingly available. One looks 
at the pattern of hits amongst different phylo- 
genetic groups (Tatusov et ah, 1997). Often these 
focus on the existence of key motifs and patterns 
associated with function (Zhang et ah, 1998; Bork & 
Koonin, 1996; Attwood et ah, 1999). 
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Figure 1. This Figure schematically depicts certain aspects of our comparison methodology, (a) The paradigm relat- 
ing sequence to structure to function. There has not been as much assessment of functional annotation transfer based 
on structure as there has been with sequence-based structural and functional annotation transfer, (b) How we concep- 
tualized our analysis in terms of pairs. A few examples of SCOP domains (identified on the left and bottom) are 
included from our comparison. In the Figure the shape represents fold, and the pattern represents function. We have 
highlighted some example categories of pairs: a pair that shares fold and function, a pair that shares fold but not 
function and a pair that shares neither fold nor function. The latter category of pairs is not considered in our investi- 
gation; we looked only at paired domains with the same fold. In constructing our pairs, we used only a representa- 
tive set of SCOP domains. This is illustrated in the Figure by the domains flagged with asterisks. Note, in particular, 
that the SCOP domain d4tima_ is not paired with anything because it is represented by d5tima_, which is the same 
species and protein. For each level of pairs (fold, superfamily, family), cluster representatives were chosen for the 
level below: (i) for family pairs, one representative was selected from each species/ protein, the level below, and then 
paired with all the other representatives within its family; (ii) for superfamily pairs, one representative was chosen 
from each family, unless there were domains in the family that shared less than 40% sequence identity, in which case 
additional representatives were included, each not more than 40% identical with the other representatives from the 
family (this occurs, for instance, for the globins); and (iii) likewise for fold pairs, one representative was chosen from 
each superfamily, more if there were domains with less than 40 % sequence identity, (c) Subdivides the pairs into the 
four SCOP classes from which they were composed: (i) all-oe, domains consisting of ot-helices; (ii) all-p, domains con- 
sisting of p-sheets; (iii) a/p, domains with integrated ot-helices and p-strands; and (iv) a + p, domains with segregated 
ot-helices and P-strands. We initially set apart the irrtmunoglobulins from the rest of the all-p pairs because we rea- 
lized that their large number biases our data. However, we compared the results for the immunoglobulin pairs to all 
other pairs and found that they generally exhibit the same behavior as the other pairs. Therefore we decided to leave 
them in the comparison. 



Sequence-structure-function 

One way that the better-defined sequence-struc- 
ture relationship can assist in function prediction is 
initially to predict the structure of an uncharacter- 
ized sequence and then predict the function based 
on the limited repertoire of functions known to 
occur with that structure. To some degree this was 
achieved by Fetrow and co-workers (Fetrow et ah, 
1998; Fetrow & Skolnick, 1998). They predicted 
structural profiles based on threading and ab initio 
methods, and then searched with these against 
profiles of known structures in order to predict 
function. 

In related work, Russell et al. (1998) discussed 
using identification of structural binding sites in 



predicting protein function. In a comprehensive 
study, Hegyi & Gerstein (1999) investigated to 
what degree folds were associated with functions. 
They found that most folds were associated with 
one or two functions with the exception of a few 
special folds, such as the TIM barrel, that could 
carry out numerous functions. Furthermore, they 
found that particular folds were often confined to 
distinct phylogenetic groups, an additional fact 
that can feed into an integrated sequence-structure- 
function analysis (Gerstein & Hegyi, 1998; 
Gerstein, 1997, 1998b,c). 

Here, we look at pairwise comparisons of 
protein sequence, structure and function among 
proteins that share the same fold. We assess the 
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trends relating sequence, structure and function 
and consider the implications for structural and 
functional annotation transfer. 

New developments: probabilistic scoring and 
growth of the databank 

The past studies regarding sequence, structure 
and function relationships often used RMS separ- 
ation and percent sequence identity (or a linear 
variant of it, such as the fraction of mutated resi- 
dues) to express similarities in structure and in 
sequence, respectively. However, it has become 
increasingly common to use probabilistic scoring 
schemes (P-values) to express the quality of a 
match in terms of statistical significance rather 
than an arbitrary raw score such as percent iden- 
tity (Pearson, 1998; Karlin & Altschul, 1990, 1993; 
Karlin et al. 1991; Altschul et al 1994; Bryant & 
Altschul, 1995; Abagyan & Batalyov, 1997). With 
P-values, scores from different investigations can 
be compared in a common framework. Recently, it 
was found that sequence and structure similarity 
significance can be expressed as P-values in the 
same unified statistical framework (Levitt & 
Gerstein, 1998). Here, we use such probabilistic 
scoring methods to overcome the limitations of the 
more traditional scores. 

Another recent development is the tremendous 
growth in the number of solved structures. The 
RCSB Protein Data Bank (Bernstein et al. 1977) now 
contains more than 10,000 protein structures. These 
structures are broken into more than 18,000 
domains, and then domains that share a fold are 
paired up with each other for comparison 
(Figure 1(b)). Here, we survey ^30,000 pairs of 
protein domains that are known to have the same 
fold, approximately 1000 times the number com- 
pared by Chothia & Lesk (1986). The large scale of 
this comparison affords greater statistical weight to 
the results. 

Alignment of 30,000 pairs from SCOP 

The basic unit of comparison: a pair of 
protein domains 

The protein domains that we studied were classi- 
fied by SCOP, a Structural Classification of Pro- 
teins (Murzin et al. 1995; Brenner et al. 1996; 
Hubbard et al. 1997), a hierarchy of five levels: 

(i) class, domains that have the same secondary 
structural content (all-oc, all-(3, a/ P, or a + P); 

(ii) fold, domains that geometrically share the same 
tertiary fold; (iii) superfamily, domains descended 
from the same ancestor (but which lack measurable 
sequence similarity); (iv) family, domains in the 
same protein sequence family (which have appreci- 
able sequence similarity); and (v) species and 
protein. 

Pairs of protein domains that are grouped 
together at the fold, superfamily or family level 
form the basic unit of our comparisons. 



Selection of pairs 

There is potentially a huge number of pairs of 
domains that can be constructed out of the 
relationships in SCOP. For instance, in the current 
version of SCOP there are ~3.9 million potential 
pairs between domains sharing the same fold. 
Most of these are between nearly identical struc- 
tures. In order to keep the number of pairs man- 
ageable, we used a straightforward clustering 
scheme, described in the legend to Figure 1. We 
selected 29,454 representative pairs from the total 
in SCOP. To achieve a wide range of similarities, 
we constructed the pairs on three levels of the 
SCOP hierarchy: (i) family pairs, 19,542 pairs of 
domains in the same family; (ii) superfamily pairs, 
4220 pairs of domains in the same superfamily 
but different families; and (iii) fold pairs, 5692 
pairs of domains in the same fold but different 
superfamilies. 

All the selected domains were at least 50 resi- 
dues in length and were drawn from the four 
major SCOP secondary-structural classes: all-a, all- 
p, oc/p, and a + P (Figure 1(c)). 

We automatically aligned each of our selected 
domain pairs twice, once by global Needleman- 
Wunsch sequence comparison (Needleman & 
Wunsch, 1971; Myers & Miller, 1998) and then 
by structure (Gerstein & Levitt, 1996, 1998), cal- 
culating scores for sequence and structural simi- 
larity. 

Web-accessible database 

The results of all the pairwise comparisons are 
available via a searchable database on the web at 
http:/ /bioinfo.mbb .yale.edu/align The query 
engine allows searches of individual SCOP pairs, 
all pairs that include a given SCOP domain, or all 
pairs containing any SCOP domain contained in a 
given PDB entry. 

Traditional scores: RMS and percent identity 

The sequence-structure relation, as expressed by 
the root-mean-square (RMS) of the aligned C a dis- 
tances and percent sequence identity, has been pre- 
viously characterized as an exponential function by 
Chothia & Lesk (1986) and others (Flores et al. 
1993; Russell & Barton, 1994; Russell et al. 1997). 
As Figure 2 illustrates, our data display a similar 
trend. (Exact equations are given in the legend to 
Figure 2.) However, we have one thousand times 
as many data points as in Chothia and Lesk's orig- 
inal study (30,000 as opposed to 30). 

The main difference between our results and 
the previous studies is due to differences in 
RMS "trimming" methods. By trimming we refer 
to the process of removing the worst-fitting 
aligned atoms from the RMS calculation, to 
arrive at a structural "core." This was first 
developed in Lesk's sieve-fit procedure (Lesk & 
Chothia, 1984) and has been refined in numer- 
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ous studies (e.g. Gerstein & Altman (1995)). This 
is done because the small distances between 
well-matched alpha carbon atoms have much 
less of an effect on the RMS than do the very 
large distances between poorly matched atoms. 
The untrimmed score of divergent protein 
domains is then concerned primarily with the 
poorly matched residues instead of the con- 
served core. Trimming alleviates this effect by 
restricting the RMS calculation to include only 
those residues believed to be in the conserved 
core. However, the degree of trimming is to 
some extent arbitrary, and this choice affects the 
baseline of the reported RMS scores. Here we 
considered only the better half (50%) of matched 
residues in a given pair of protein domains. 
Chothia &c Lesk (1986) chose a somewhat differ- 
ent threshold. Figure 2(c) and (d) demonstrate 
the effect of trimming. 



Analogous alignment similarity scores: Smith- 
Waterman score and structural 
comparison score 

The dependence of the RMS separation on trim- 
ming method restricts its usefulness in comparing 
data. Likewise, there are many problems with 
using percent identity as a measure of sequence 
similarity. For instance, a match of non-identical 
but still similar residues (e.g. Arg versus Lys) scores 
the same as one between completely different resi- 
dues (e.g. Arg versus Val), and gaps do not enter in 
the score calculation. Consequently, we now turn 
to alignment similarity scores, which eliminate 
some of the problems with traditional scores. 

For sequence alignments, an alignment score is 
defined as the sum of the similarity matrix values 
for the alignment, minus the total gap penalty. 
This is sometimes called the Smith-Waterman score 
(Smith & Waterman, 1981). An analogous align- 
ment score for structure is the structural compari- 
son score, described by Levitt & Gerstein (1998). 
We will refer to these two similarity scores as 
and S str/ respectively. Note that they both increase 
for more similar pairs, whereas RMS increases for 
more divergent pairs. Specifically, S str is the score 
maximized by the structural alignment program 
we used (Gerstein & Levitt, 1998). It can be calcu- 
lated from any pair of aligned structures according 
to the function: 



S str =M£ 



1 + 



(!) 



2 



(1) 



/ 



M and d Q are constants, usually set to 10 and 5 A, 
N gap is the number of gaps in the alignment, d { is 
the distance between each aligned pair of C a 
atoms, and the sum is carried over all aligned 
pairs, i. 



The main advantage of S str over RMS in describ- 
ing structural similarity is that the C a to C a 
distance, d if appears in the denominator of the cal- 
culation. This means that the smallest distances, 
corresponding to the best matches in the conserved 
core, are most significant in determining the score. 
Hence, the need for trirnming is eliminated. S str is 
also advantageous because it takes gaps into 
account and because of the fundamental analogy 
between this score and S^. 

Figure 3(a) displays the relationship between 
structural and sequence similarity as expressed by 
S str and S^. Figure 3(c) and (d) show calibration 
curves relating each of these scores back to 
approximate RMS separation and percent identity, 
respectively. Calibration curves help one get an 
intuitive feel for the degree of relationship in terms 
of the more traditional scores. Figure 3(b) adds a 
third axis, alignment length, and demonstrates that 
S str depends greatly on mis quantity. Although S str 
and are "better" scores than RMS and percent 
sequence identity, the heavy dependence of both of 
these on length limits their usefulness in many 
situations. In other words, two pairs of similar 
domains with equal percent sequence identities but 
different lengths can have drastically different 
scores. 

Probabilistic scores: P- values expressing the 
significance of sequence and 
structure similarity 

Probabilistic scores can, to a great degree, over- 
come the length-dependence problems associated 
with the alignment scores. Probabilistic measures 
are advantageous because they express similarity 
not by an arbitrary "score" but by a statistical sig- 
nificance: the likelihood that such a similarity 
could be achieved by chance. This likelihood is 
also called the "P-value." We used calculations 
(described in detail in the legend to Figure 4) 
based on those given by Levitt & Gerstein (1998) to 
obtain P-values based directly on and S^; we 
refer to these calculated P-values as P str and P^, 
respectively. For P we could equally well have 
used the numbers from one of the popular 
sequence search programs (i.e. BLAST or FASTA) 
as all these values have been shown to be perfectly 
proportional to each other (Levitt & Gerstein, 1998; 
Brenner et al 1998). 

^seq anc * ^str c an be used to express the relation- 
ship between structure and sequence similarity on 
a more fundamental level. Figure 4(a) shows a log- 
log (base 10) plot of P str against P^. Because it is 
log-log, trends can be visualized as straight lines. 
Two straight lines are necessary to fit the points 
well, with the discontinuous boundary between 
the lines located at the beginning of the twilight 
zone. The different slope of the line at low 
sequence similarity reveals that in the twilight 
zone there is a different relationship between the 
significance of structural similarity and that of 
sequence similarity. In particular, for domain pairs 
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Figure 2. RMS as a function of percent identity, (a) A simple scatter plot of our pairs, relating RMS separation to 
percent sequence identity. This is similar to the presentation given by Chothia & Lesk (1986), but in this survey we 
looked at 30,000 pairs, 1000 times the number they compared. Outliers (pairs with RMS scores further than two stan- 
dard deviations from the mean for their percent identity) are excluded from this graph; they represent domains that 
are very closely related with the exception of a conformational change, (b) A simplified graph with a number of fits 
to the data. For each percent identity bin we show the median RMS value, indicated by (♦) and the top and bottom 
quartile RMS values, indicated by the bars. Two fits are drawn through the median RMS values. The thin line, 
labeled SINGLE, is a simple exponential fit through the medians. It has the form: 

R = 0.21e 00132H 

where R is the RMS deviation after least-square fitting, H is the percent difference between the sequences (H for 
Hamming distance), and H = 100 % - /, where / is the percent sequence identity. The thick line, labeled MULTI, is a 
multigraph fit, which is described in the legend to Figure 4. The relation between RMS and percent identity according 
to this fit is expressed by the equation: 

R = 0.18e°* 0187H 

The twilight zone of sequence identity and below is labeled T2. In this region, sequence similarity is not significant 
and not reliable for predicting structural similarity. This is why the median values in this area of the graph deviate 
significantly from the fits, which consider only data above 20 % sequence identity. For reference we include the orig- 
inal data points from Chothia and Lesk's, 1986 paper (A.M. Lesk, personal communication), indicated by X. Their 
data follow the form: 

R = 0.40e 00187H 

The difference between the Chothia & Lesk trend and our relationship is due to the different trimming methods used 
in calculating the RMS score. Chothia and Lesk imposed a 3 A cut-off in determining the conserved core residues; we 
defined the core as the better matching (in terms of C a distances) half (50%) of the residue pairs, (c) and (d) The 
effect our trimming has on median RMS values. The RMS values in (c) are calculated from all the matched residues 
in each pair; the values in (d) are calculated from the better matching 50 % of the residues. 



in the twilight zone (according to the percent iden- 
tity to calibration in Figure 4(b)), structural 
similarity is more significant than sequence simi- 
larity (having a smaller P-value or more negative 
log P-value). In contrast, for pairs with more than 
~30% identity, the situation is reversed, with a 
given pair having more significant sequence simi- 
larity than structural similarity. One possible 
interpretation of this reversal is as follows. Struc- 
ture is always more highly conserved than 
sequence, so usually a given amount of structural 
similarity is not as significant as a corresponding 
amount of sequence similarity. However, this is 
true only when meaningful sequence similarity 



actually exists; thus, it does not apply in the twi- 
light zone, where sequence similarity is by defi- 
nition not significant. Note that all pairs in our 
comparison share at least the same fold, implying 
that they always have a significant amount of 
structural similarity. 

In other words, for closely related sequences, 
differences in sequence similarity are more mean- 
ingful, whereas for highly diverged sequences that 
share the same fold, the differences in structural 
similarity are more significant. 

Fitting two lines to the P str versus P^ graph 
suggests that the same might be done for other 
scoring schemes. It is possible to some degree to fit 
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Figure 3. Similarity scores: structural comparison score as a function of Smith-Waterman score. Alignment simi- 
larity scores S str and have certain advantages over RMS and percent identity scores for expressing the sequence- 
structure relation. 5 str is calculated according to equation (1) in the text (Gerstein & Levitt, 1998; Levitt & Gerstein, 
1998). S^q is calculated using the BLOSUM50 matrix (Henikoff & Henikoff, 1992) with gap opening and extension 



penalties of -12 and -2, respectively, (a) This is analogous to (b) in Figure 2. From the original 30,000 pairs we show 
the median S str value for each bin, along with quartile bars above and below. Again the twilight zone and below 
is labeled T2. The thin line, marked SINGLE, is a simple fit to the median S str values in this graph; it has the form: 

S str = 2144 - 1106exp(-0.00544Sse q ) 

The thick fit, marked MULTI, is the multigraph fit, explained below. It follows the equation: 

S s tr = 2157 - 787exp(-0.0028S seq ) 

The equations presented here provide an approximation of the observed trends; as (b) illustrates, they are nothing 
more than simple approximations. The main disadvantage of S str as a measure of structural similarity is its heavy 
length dependency for pairs of structurally similar protein domains, (b) Surface plot of the median S str as a function 
of and alignment length (the number of matched residue pairs). It is clear that the size of the aligned domains 
plays a major role in the resulting S str/ even though our fits do not take length into account, (c) and (d) Relate S^ 
and S str to the more familiar percent identity and RMS measures. The fits were used to convert between scoring 
schemes in constructing the multigraph fit. We derived the multigraph fit in order to create one set of equations and 
parameters that would relate sequence and structural similarity using either the percent identity and RMS scheme or 
the S seq and S str scheme, and allow translation between them. We simultaneously performed least-squares fits to the 
median values in four graphs: Figures 2(b) and 3(a) and the calibrations of S^ to percent identity and S str to RMS, 
(c) and (d), respectively. In all cases, we ignored data in and below the sequence identity twilight zone (labeled TZ). 
The parameters in (a) are dependent on the parameters in Figure 2(b) via the mentioned calibrations. 



the traditional RMS versus percent identity graph 
(Figure 2) with two straight lines instead of an 
exponential cruve. However, in this case, we opted 
for the more conventional presentation. 

Class differences 

The division of SCOP into classes based on sec- 
ondary-structural composition allows easy investi- 
gation as to whether there are any deviations from 
the common similarity relationships on account of 
secondary-structure characteristics. Figure 5(a) 
reveals that secondary structural composition does 
not markedly affect the trends in sequence and 
structure similarities. This is consistent with the 



data given by Wood & Pearson (1999). However, 
the larger average length of a/p domains com- 
pared with domains in the other classes results in a 
deviation in the length-dependent S str (Figure 5(b)). 
The consistency among length-independent scores 
applies for certain individual folds as well. The 
immunoglobulin fold makes up an appreciable 
fraction of all the P-pairs (Figure 1(c)), yet the 
results are not affected if these pairs are left out. 

Linking sequence and structure to function 

Difficulties of functional comparison 

There is a clear, well-characterized relationship 
between sequence and structure similarity, which 
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Figure 4. Probabilistic scores: P-values. P se( , and P str are P-values calculated from S seq and S str according to the 
formalism given by Levitt & Gerstein (1998). Both quantities have the same overall functional form in terms of an 
extreme value distribution: 

P = 1 - exp(- exp(-Z)) 

where P is either P^ or P str . For P seq , 2 = S /a - 2 InM - b/a, where a = 5.84, b = ~ 26.3, and M is the geometric 
mean of the lengths of the two sequences (i.e. M 2 = nm, where n and m are the two sequence lengths). For P str , Z is a 
function of S str and N, the number of matched residues: For N < 120: 

Z = (S str - cln 2 N - rflnN - e)/(fh\N + g) 

For N ^ 120: 

Z = (S s t r - « InN - b)/(f In 120 + g) 

At AT = 120, continuity implies that: 

alnl20 + fc = cln 2 120 + dlnl204-e and a = 2clnl20 + d 

This, in turn, allows the calculation of the constants: 

a = 171,8, b = -419.4, c = 18.4, d = -4.50, e = 2.64, / = 21.4, g = -37.5 

(a) of this Figure is analogous to Figures 3(a) and 2(b), with the exception of the fits. It is a log-log (base 10) plot 
relating P and P str . We show the median log(P str ) value for each Iog(P seq ) bin, along with quartile bars above and 
below. We nave added approximate percent identity and RMS values to the x and y axes to aid interpretation of the 
graph in terms of more familiar scores. The values were calculated using the calibration curves in (b) and (c). The 
straight-line nature of the log-log plot reveals distinct relations inside and outside the twilight zone, labeled TZ. (The 
area of percent identity below the twilight zone does not appear in P^ graphs, there is no significance for such low 
sequence similarity; thus all data points in that zone appear at P seq = 1 or log[P seq ] = 0.) The thick line in the figure is 
fit to the median P str values for P^ values outside the twilight zone; its equation is: 

p str = io- io p°4 5 

The thin line is fit to the data inside the twilight zone; it follows the relation: 

p str = io- 6 p°4 74 

For reference we include the dotted line, representing the function P str = P^, where sequence and structural simi- 
larity are equally significant. See the text for a discussion of how the two trends might be interpreted with respect to 
this line. 



can be used to transfer precisely structural annota- 
tion based on the degree of sequence homology. In 
genome analysis, however, one is usually more 
interested in finding a functional annotation for an 
open reading frame based on similarity to well- 



known proteins; yet the sequence-function and 
structure-function relationships have not been as 
explicitly characterized. The fundamental obstacle 
to extending this and similar investigations to deal 
with function is the absence of a clear measure of 
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Figure 5. SCOP class differences. Previously it has 
been observed that secondary structural composition 
does not cause deviations from the trends in structure 
and sequence similarity (Flores et al 1993). To test this 
observation we looked at the scores divided by SCOP 
class. The following legend applies to the graphs: ( — 
■ — ), aU alpha; (- ♦ -), all beta; (--A--), alpha /beta; 
(- - x - -), alpha + beta, (a) Median RMS values for 
each percent identity bin. The traditional scores reveal 
no dependency on class. However, in (b) oc/p pairs con- 
sistently score higher S str scores than pairs in other 
classes. This is a consequence of the dependence of S str 
on length; domains in the a/ p class are longer, on aver- 
age, than in the other classes. 



functional similarity. Although we were able to 
present three different quantitative measures of 
structural relatedness, an analogous situation for 
function does not exist. How can one express 
quantitatively the degree of similarity between a 
triosephosphate isomerase and a glucoses-phos- 
phate isomerase? How do they compare to trp 
repressor? 

The absence of a clear measure of functional 
similarity is not the only obstacle in transferring 
the functional annotations between proteins with 
different degrees of homology. The definition of 
function itself is often vague. More specifically, at 
present there is an absence of such important infor- 
mation as a standardized vocabulary for protein 
functional annotations with an associated number- 
ing scheme, descriptions of monomer functions of 
subunits of multisubunit proteins and hierarchical 
functional assignments for proteins with multiple 



functions. As a consequence of these difficulties 
there is no functional equivalent to the hierarchical 
fold classification for domains in PDB. 

As signs of progress in this direction, several 
functional classifications have been developed to 
date. One is the ENZYME system developed by 
the Enzyme Commission (EC) to classify enzymes 
by reaction type (Webb, 1992). This system has the 
advantage that it is "universal," applicable to 
proteins in many different organisms, and is in 
wide use. However, it also has several drawbacks. 
First of all, it does not consider catalytic reaction 
mechanisms (Riley, 1998a), often ignoring obvious 
similarities. Second, it presumes a 1:1:1 relationship 
between gene, protein and reaction, although this 
is often not the case (an enzyme can have 
two functions, or two polypeptides from two 
different genes can oligomerize to perform a single 
function). Perhaps the most significant drawback 
of the EC classification is that it applies to only 
enzymes. 

A number of more comprehensive schemes 
have been developed, which classify non- 
enzymes as well as enzymes. Most of these 
focus on individual organisms. Several such 
schemes exist, for instance, GenProtEC/EcoCyc 
for E. coli (Karp et al, 1998b; Riley & Labedan, 
1996; Riley, 1998b), MIPS for yeast (Mewes et al, 
1998), Ashburner's functional classification for 
Drosophila, which is connected to FLYBASE 
(Ashbumer & Drysdale, 1994), and EGAD for 
human ESTs (Adams et al, 1995). These classifi- 
cations possess some advantages. They have 
additional levels of hierarchy that help present a 
more comprehensive picture of genotype-pheno- 
type relationships. On the other hand, these 
classifications still leave much room for improve- 
ment. For example, there is no standardized 
vocabulary to allow for keyword searches 
among multiple databases and across organisms, 
and mere are inconsistencies in category num- 
bering style. 

Finally, there has been some promising work 
going beyond the ENZYME and organism-focused 
classifications. There has been progress on comple- 
tely automated functional classification (des Jardins 
et al, 1997; Tamames et al, 1997), which has the 
potential for putting function assignments on a 
more objective basis. There are a number of data- 
bases synthesizing the various enzyme functions 
into coherent pathways and systems (e.g. KEGG 
and WIT, Ogata et al, 1999; Selkov et al, 1998). 
There also have been some very recent attempts to 
develop cross-species classifications of non-enzyme 
functions in the framework of the Gene Ontology 
Project (GO, geneontology.org). GO is a joint pro- 
ject between FlyBase, the Saccharomyces Genome 
Database and Mouse Genome Informatics, 
attempting to merge the fly, yeast and mouse 
functional classification schemes. However, a truly 
universal system for classifying all protein func- 
tions in all organisms within the same framework 
remains quite a challenge because of the 
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sheer diversity of organisms and distinct protein 
functions. 



Our simple functional classification of SCOP 
domains: FLY+ ENZYME 

Given the discussed limitations, we constructed 
a simple functional classification for the SCOP 
domains included in our comparison; our classifi- 
cation is based on a merger of two of the existing 
functional annotations and a cross-referencing of 
subsets of this combination with some of the 
organism-specific schemes. First, we used pairwise 
comparison to cross-reference the PDB domains 
against the Swissprot database (Bairoch & 
Apweiler, 1998), as described by Hegyi & Gerstein 
(1999). We chose to assign protein functions 
according to Swissprot because it provides more 
comprehensive functional annotations than SCOP. 

We were initially able to divide all entries into 
enzymes and non-enzymes, a division that rep- 
resents the highest level of functional difference in 
our classification scheme (Figure 6). For the 
enzyme category, we transferred EC (Webb, 1992) 
numbers to those SCOP domains with a one-to-one 
match to a Swissprot enzyme. Only one-to-one 
matching entries could be considered because 
Swissprot assigns ENZYME numbers to entire pro- 
teins, whereas SCOP is a domain-based classifi- 
cation; therefore we could be confident about the 
classification of only those domains which map to 
an entire Swissprot entry. 

In the absence of an EC-type classification for 
non-enzymes, we assigned functions to non-enzy- 
matic SCOP domains according to Ashburner's 
original classification of Drosophila protein func- 
tions. This classification is derived from a con- 
trolled vocabulary of fly terms. It is available on 
the web and loosely connected with the FLYBASE 
database (Ashburner & Drysdale, 1994). For clarity, 
we precisely describe the specific files and version 
(1.55, 1997) of the classification that we used in the 
caption to Figure 6, and we will hereafter refer to 
these data files as constituting the original FLY 
classification. 

The FLY classification is a dynamic object, chan- 
ging as more is learned about the fly and other 
organisms. This is particularly true of late with the 
imminent completion of the Drosophila genome. In 
fact, since the completion of our analysis, the FLY 
classification has been superceded by the new GO 
classification (see above). 

The hierarchical structure of the FLY classifi- 
cation makes it well suited for classifying non- 
enzymatic SCOP entries in a manner comparable 
to the ENZYME assignments for the enzymes. 
Another advantage of this classification is that it is 
more compatible with the makeup of the PDB than 
the E. coli and yeast classifications, as Drosophila is 
a multi-cellular organism, and many of the known 
structures come from animals. We were able to use 
the original FLY classification as a framework to 



which we added functional categories and individ- 
ual proteins. For instance, we added "Hemo- 
globin" to the "Physiological Processes - 
Respiration" category. Another example is the 
"Physiological processes - Immunity" category 
(Figure 6(b)), to which we added immune system 
proteins. Many of the additions would not be 
necessary in the context of the new cross-species 
GO system. We also modified slightly the number- 
ing scheme in the original FLY classification in 
order to assign a unique hierarchical number to 
each protein domain (Figure 6(b)). We will refer to 
our augmented FLY classification as the FLY+ 
scheme, and our merged scheme as the FLY-h 
ENZYME classification. 

As discussed earlier, the universal functional 
classification of proteins is very challenging and 
may not be possible with the current level of 
knowledge about genes, proteins and genomes. 
Consequently, the FLY + ENZYME classification 
of SCOP proteins is somewhat incomplete and 
inconsistent and retains many of the limitations 
of its components (Hegyi & Gerstein, 1999; 
Riley, 1998a). It is not yet broad enough to 
include many plant, virus and bacterial proteins. 
Nevertheless, it was sufficient for our analysis, 
as we were able to classify a very large number 
of the total 30,000 pairs. 



Determining functional similarity 

Using our compound functional classification, 
we were able to assign a level of functional simi- 
larity to each domain pair. According to our 
scheme, a pair can have no functional similarity 
(an enzyme paired with a non-enzyme) or it can 
have one of three levels of similarity: 

(i) General similarity. Both domains are 
enzymes or both are non-enzymes. 

(ii) Same functional class. Both domains share 
the first component of their ENZYME or FLY + 
numbers, e.g. 1.1.1.1 alcohol dehydrogenase and 
1.3.1.1 cortisone beta-reductase (for enzymes), or 
3.3.21.2 calcicyclin and 3.6.3.2.1 calmodulin (for 
non-enzymes). 

(iii) Same precise function. Both domains share 
three components of their ENZYME or FLY + 
number, e.g. 1.1.1.1 alcohol dehydrogenase and 
1.1.1.3 homoserine dehydrogenase (for enzymes) 
or 1.2.9.1.1.1 Arc repressor and 1.2.9.1.1.1 C-jun 
(for non-enzymes; both are transcription factors). 
A pair that shares precise function must also, by 
definition, share functional class and general 
similarity. 

Based on those assignments we calculated the 
percentage of total pairs at a given level of 
sequence or structural similarity possessing each 
level of functional similarity. The results appear in 
Figure 7. 
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Sequence and function 

The relation between sequence similarity and 
functional similarity behaves as one might expect, 
with sigmoidal curves that drop off sharply at par- 
ticular conservation thresholds, and with the three 
levels of functional similarity (precise function, 
functional class and general similarity) having pro- 
gressively lower thresholds. Figure 7(a) shows that 
precise function is not conserved below 30-40% 
sequence identity, whereas functional class is con- 
served for sequence identities as low as 20-25%. 
Below 20%, general similarity is no longer con- 
served; among pairs of approximately 7% 
sequence identity, about 40% are enzymes paired 
with non-enzymes. It is important to note that in 
all the pairs considered here, the domains share 
the same fold. Functional similarity at low percent 
identities (e.g. 7%) would be much less for all 
possible pairs of domains rather than just for those 
with the same fold. It is also important to remem- 
ber that our thresholds for functional conservation 
are statistical averages over many sequences; one 
will, of course, be able to find individual cases that 
diverge more or less rapidly. 

There are differences between the functional con- 
servation thresholds of enzymes and non-enzymes, 
with enzymes appearing to more highly conserve 
precise function than non-enzymes, but non- 
enzymes conserving functional class more highly 
than enzymes. This may reflect that in our classifi- 
cation, the non-enzyme functional classes are 
broader and hence easier to conserve than those of 
the enzymes, while the non-enzymatic precise 
functions are more specific. 

When P^ is used as the measure of sequence 
similarity (Figure 7(b)) the results look somewhat 
different, it appears that functional class is con- 
served for the entire range of sequence similarities. 
In this case, percent identity is actually more discri- 
minating than P because functional class 
diverges only at sequence similarities that are low 
enough that they have little or no statistical signifi- 
cance, i.e. for the divergence is compressed 
near the vertical axis of the graph. 

Structure and function 

The relation between similarity in structure and 
function is somewhat less straightforward than 
that between similarity in sequence and function. 
Figure 7(c) shows the relationship between RMS 
and functional similarity. Broadly, it appears simi- 
lar to that for percent identity and functional simi- 
larity; however, the thresholds for conservation of 
the various types of functional similarity are less 
sharp. 

RMS is more revealing with respect to functional 
similarity than the non-traditional structural scores, 
S str and ? str (Data for S str and P str are not shown 
but are available from the website.) The reason is 
that, while very structurally similar pairs all have 
RMS scores clustered between 0 and 0.5 A, S str has 



a large range of scores for similar pairs due to the 
length dependency, and P str does not have any 
limit for maximum similarity. The wide range of 
possible S str and P str scores for similar structures 
tends to blur the broad sigmoid curves so much so 
that they are no longer apparent. 

Alternative functional classifications: MIPS 
and GenProtEC 

To get some perspective on the degree to which 
our results reflected the particularities of our com- 
bined FLY -f ENZYME classification, we decided 
to try the same comparisons based on the well- 
known functional classifications for yeast and 
£. coli, MIPS and GenProtEC (Mewes et al, 1998; 
Riley & Labedan, 1996; Riley, 1998b). These classi- 
fications have the advantage that they integrate 
enzyme and non-enzyme functions from the start 
and are widely used. However, as they are only 
applicable to individual organisms, we could only 
use them to classify a considerably smaller subset 
of the known structures than the compound FLY -h 
ENZYME system. 

The specific way we used the MIPS and Gen- 
ProtEC classifications to assign function to struc- 
tures and to calculate functional similarities is 
described in the legend to Figure 7. Our results 
in terms of functional conservation (precise and 
class) at various levels of percent identity are 
shown in Figure 7(d). We observe the same gen- 
eral relationships as we did for our FLY - 
4- ENZYME scheme. That is, the functional 
conservation curves have a sigmoidal shape and 
have cut-offs for precise functional similarity 
after 40% and for functional class similarity at 
lower values. However, because the MIPS and 
GenProtEC classifications are restricted to indi- 
vidual organisms, each curve represents con- 
siderably fewer data points than do the curves 
based on the FLY + ENZYME scheme; this 
required us to "bin" the MIPS and GenProtEC 
curves in a somewhat coarser fashion. 



Discussion and Conclusion 

Here, we assessed the transfer of functional and 
structural annotation by analyzing the relation- 
ships between similarity in sequence, structure and 
function. The ~30,000 protein domain pairs of 
varying levels of similarity (at least the same fold) 
that we constructed out of the SCOP classification 
show quantitative sequence-structure relationships 
consistent with previous research. The exponential 
relationship is consistent across the secondary- 
structural classes and holds for newer probabilistic 
scoring methods. 

The sequence-function and structure-function 
relationships have not been studied as precisely 
due to the lack of a robust functional classification 
and measure of functional similarity. To overcome 
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Figure 6. Functional classification of enzymes and non-enzymes, (a) Divides the pairs by general function. There 
are three categories of pairs: (i) enzymes paired with non-enzymes (no general functional similarity), labeled ENZ/ 
~ ENZ; (ii) enzymes paired with enzymes (same general function), labeled ENZ/ ENZ; and (iii) non-enzymes paired 
with non-enzymes (same general function). Pairs for which one or both domains could not be identified as enzyme 
or non-enzyme are not included in this chart. Enzymes are classified according to the EC system (Webb, 1992). The 
first component of the number represents the nature of reaction and is called class. There are six classes: oxidoreduc- 
tases, transferases, hydrolases, lyases, isomerases and ligases. The next level is subclass. It refers to the chemical 
groups on which the enzyme acts. For example, the first class, oxidoreductases, has 19 subclasses that are arranged 
according to the donor group that undergoes oxidation (CH-OH, aldehyde or oxo group, CH-CH group, etc). For 
another group of enzymes (hydrolases) subclass is determined by the nature of the bond: ester bond, peptide bond, 
etc. The next level is sub-subclass. For oxidoreductases this indicates the acceptor group: NAD(+) and NADP(+), or 
cytochrome; for hydrolases the sub-subclass represents the nature of substrate (carboxylic ester hydrolases, thiolester 
hydrolases, etc.). The fourth level represents a unique number for each individual enzyme, for example, 1.1.1.1: alco- 
hol dehydrogenase, (b) Shows how we adapted the functional classification of Drosophila gene products developed 
by M. Ashbumer. This classification is loosely connected with FLYBASE (Ashburner & Drysdale, 1994). We used ver- 
sion 1.55 (4 August 1997) that was available from Ashburner's website: 

http : //www. ebi.ac.uk/ ~ ashburn 

The specific files that we used were taken from the ftp directory: 

ftp.ebi.ac.uk/databases/edgp/misc/ashburner 
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this we constructed our own classification by mer- 
ging and extending the ENZYME and FLY 
schemes and assigning levels of functional simi- 
larity. Our measures of functional similarity pro- 
vide curves relating function to sequence and 
structure; when relating functional conservation to 
sequence divergence, we find distinct thresholds at 
^40% for precise function and ~25% for func- 
tional class. 

One of the interesting results that emerges from 
this is that percent identity is more useful for quan- 
tifying functional divergence than the newer prob- 
abilistic scores. In general, modern probabilistic 
scores, such as P^, are better at discriminating 
amongst highly diverged sequences (near the twi- 
light zone) than percent identity, since they better 
take into account gaps and conservative substi- 
tutions (of similar amino acids). However, for very 
similar pairs of sequences, percent identity is a 
simpler and more direct measure of divergence 
(essentially a Hamming distance). Since divergence 
in precise function takes place before that in struc- 
ture (well before the twilight zone), it is quite 
reasonable that percent identity is more successful 
at measuring the former than the latter and that 



the converse is true for the probabilistic scores. In 
other words, percent identity is better calibrated 
for discriminating amongst very close, significant 
relationships and for more distant ones. 



Practical implications 

The sequence-structure and sequence-function 
relationships described here provide practical 
information for genome annotation in terms of 
folds and functions. Table 1 summarizes the rela- 
tive advantages of the different scoring methods 
we used. Using the trends in sequence and struc- 
ture similarity, one can assess the degree to which 
structural annotation can be transferred between 
sequences at a given level of sequence similarity. 
The sequence and function similarity thresholds 
potentially establish minimum requirements of 
sequence similarity for reliable function prediction. 
Note that because the protein domain pairs con- 
sidered here all share the same fold, the numbers 
for all possible pairs will differ in the region of 
very little sequence identity, in which the sequence 
similarity is not enough to indicate the same fold. 



We refer to these as constituting the original FLY classification. Recently, the FLY classification has been superceded 
by the GO (Gene Ontology) Project classification, which merges fly, mouse and yeast annotation. Files related to the 
GO classification are available from www.geneontology.org In the original FLY classification all members of the high- 
est level are labeled 0, representatives of the next level are labeled 1, and all lower levels are labeled 2 through to 9. 
We changed the numbering scheme so that it will reflect the hierarchical nature of the classification. This 
Figure illustrates sections of the original and modified classification. The top level in the FLY classification scheme is 
called "Function primitive" (level 0) and includes five classes: "Metabolism," "Intracellular protein traffic," "Cell 
structure," "Developmental process," "Physiological process," and "Behavior." The next level after "Function primi- 
tive" is "Process" or "Molecule" (level 1 in Ashbumer's classification). For "Function primitive - Metabolism" the 
processes are "Carbohydrate metabolism," "Nucleotides and nucleic acids metabolism," etc. For "Function primitive 
- Cell Structure" the "Process" can be "Nucleus," "Mitochondrion," "Membrane," etc. The next level is "Pathway" 
or "Macromolecule" (level 2 in the original classification). "Pathway" can include "Metabolic pathway," "Signaling 
pathway," or "Developmental pathway." The "Macromolecule" category includes "Protein" and "Nucleic Acid". We 
added categories to the original classification in order to classify some mammalian proteins that are widely rep- 
resented in SCOP but are absent from the original FLY scheme. These categories include immune system proteins 
(labeled "new" in (b) and respiratory proteins such as hemoglobin and myoglobin that we added to "Function primi- 
tive - Physiological process - Respiration". We call our adaptation of the original FLY scheme, FLY -(- . Further infor- 
mation on this adaptation is available at: 

http : //bioinfo. mbb.yale.edu/align/func 

(c) The overall hierarchy of our final scheme and identification of the different levels of similarity. If two proteins are 
both enzymes or both non-enzymes, then they possess general functional similarity. If they share the first component 
of their classification numbers, then they are in the same functional class. If they share the first three components of 
their enzyme numbers (or the equivalent for non-enzyme numbers, depending on category) then they have the same 
precise function. A significant difference between the two main branches of the hierarchy is that the levels of the 
ENZYME classification do not correspond exactly to those in the FLY-h system because the fly classification is more 
extensive than the enzyme classification. For instance, the FLY classification takes into account aspects of cellular 
(cytoskeleton, metabolic pathways, etc.) and phenotypic function (morphology, physiology, behavior) that are absent 
from the ENZYME scheme. This makes our classification of SCOP proteins somewhat unbalanced, as non-enzymes 
have much broader and more loosely defined functional classes. As a consequence, while each enzyme is assigned a 
four-component number, the length of a non-enzyme number varies, depending on the functional category to which 
it belongs. For example, myosin is assigned a number that happens to have the same length as EC numbers: 3.12.1.1. 
However, transcription factors are numbered 1.12.9.1.1.1. We took into account this varying hierarchy depth in decid- 
ing how many components are necessary to identify precise function in each category. Note that what we mean by 
domains having the same precise function is not the same as the domains coming from the same essential protein. 
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Figure 7. Linking sequence, structure and function. We express functional similarity as the fractional percentage of 
pairs at a given level of sequence/structural similarity for which the paired domains share a precise function, func- 
tional class, or general similarity (according to our classification, see Figure 6). The following legend applies to (a) 
through (c): ( — O — ), general similarity; ( — x — ), non-enzymes with same functional class; ( — A — )/ enzymes 
with same functional calss; ( — x — ), non-enzymes with same precise function; and (---A---)/ enzymes with the 
same precise function, (a) Relates functional similarity to sequence similarity in terms of percent identity. The func- 
tional similarity appears as a sharp sigmoid, with distinct thresholds of divergence for precise function, functional 
class, and general similarity. Enzymes are paired with non-enzymes only at very low percent identity, in and below 
the twilight zone (labeled TZ). At slightly higher sequence identity, pairs diverge with respect to functional class, and 
beyond 40% identity with respect to precise function. Note that 50-100% identity is not shown because almost all 
domains that are that similar share function with their counterparts, (b) Shows the same data using as the 
measure of sequence similarity. Only the divergence in precise function is visible because there is such little signifi- 
cance for the low sequence similarity at which functional class and general similarity diverge, all data points in that 
region appear near P xq = 1 or log[P ] = 0 (the y-axis). (c) Illustrates that the structure-function relation is not as 
clearly defined as that for sequence and function. Functional similarity expressed in terms of RMS separation appears 
as a broad sigmoid curve; there are thresholds of divergence for precise function, but the divergences in functional 
class and general similarity are more gradual. The thresholds are apparent only because RMS clusters the most struc- 
turally similar pairs between scores of 0 and 0.5 A. For this reason, RMS is better at discerning functional similarity 
than S str and P str , which do not cluster the most similar pairs around a set limit, (d) Shows the same relationships 
(functional conservation versus percent identity) as in (a), except that for this graph functional similarity is determined 
in terms of the MIPS (Mewes et al, 1998) and GenProtEC (Riley, 1998b) classifications rather than the FLY- 
+ ENZYME scheme. The legend appears as the inset on the graph. We assigned MIPS and GenProtEC classifications 
to SCOP domains based on sequence comparisons to classified yeast and £. coli open reading frames (ORFs), respect- 
ively. The SCOP domain most closely matching each ORF classified in MIPS or GenProtEC was assigned the corre- 
sponding MIPS or GenProtEC function number. Only matches of 80 % sequence identity or greater were considered. 
We used this SCOP domain as a functional representative; when determining functional similarity, we assigned to 
SCOP domains with no MIPS or GenProtEC functional designation the function of the closest representative with at 
least 85% sequence identity, if one existed. GenProtEC functional identifiers are three-component numbers. We con- 
sider a pair of domains sharing the first component of their functional designation to be in the same functional class. 
Domains that share all three components are said to have the same precise function. For MIPS the functional desig- 
nation is not as straightforward, as one ORF can be assigned multiple functions. Therefore we consider domains 
which have at least one function in common to share functional class. Domains with all functions in common, the 
same combination of identifiers, share precise function. Because MIPS and GenProtEC each classify the proteins of a 
single organism, yeast and E. coli, respectively, these classifications can determine the functional similarities of only a 
small fraction of all our SCOP domain pairs. The data based on these classifications, appearing in (d), are therefore 
very sparse compared to the data in (a)-(c). Despite the coarseness of the data, functional similarity based on the 
MIPS and GenProtEC classifications follows the same general relation to sequence similarity as does functional simi- 
larity based on the more comprehensive FLY -f ENZYME scheme. Vertical line indicates an approximate threshold of 
functional divergence at 40 % identity. 
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Table 1. Summary of scoring methods 




Sequence similarity 


Structural similarity 


Features 


Limitations 


Traditional scores 

Alignment similarity 
scores 


Per cent sequence 
identity 


RMS C* separation 


Well understood, in use; 
percent identity better for 
looking at functional 
similarity 

Analogous similarity scores, 
S str depends most highly 
on best matches 


RMS depends most highly on 
worst matches, requiring 
arbitrary trimming; percent 
identity is insensitive to gaps 
and conservative substitutions 
Dependence on alignment 
length 


Modern probabilistic 
scores 


P 




Statistical significance, 
unified framework for 
different comparisons 


Not as familiar as RMS and 
percent identity 



The Table lists the schemes presented here for characterizing the sequence-structure relationship, along with their relative advan- 
tages and disadvantages. 



Practically, then, when one searches an unchar- 
acterized open reading frame against known struc- 
tures, if the open reading frame matches a 
structure with a good e-value or percent identity, 
then the curves presented here can be used to 
check how the functional and detailed structure 
annotation will transfer. For example, if an 
unknown open reading frame matches a PDB 
structure with an e-value of 0.001 and a percent 
identity of 30%, then one can be assured that it 
has the same fold (Brenner et al, 1998) and accord- 
ing to our analysis it has a two-thirds chance of 
having the same exact function. Furthermore, it 
has a ~99 % chance of having the same functional 
class and its structure probably diverges from the 
known structure by a trimmed RMS of less than 
0.7 A. 



Future directions 

There are a number of directions in which we 
might extend this analysis. With respect to the 
sequence-structure relation, we can reduce the 
overrepresentation of the immunoglobulins and 
improve the calculation of P str (by redoing the fit 
to the extreme value distribution reported by 
Levitt & Gerstein (1998) to eliminate residual 
length-dependency. 

In the functional realm, we can investigate if and 
how the sequence-function and structure-function 
relationships vary for different categories of pro- 
teins. For example, although we found consistency 
of the sequence-structure relationship among sec- 
ondary structural classes, Hegyi & Gerstein (1999) 
found that the distribution of enzymes and non- 
enzymes varies with secondary structural class. 
A related issue is that of conformational changes. 
It is conceivable that among domains with very 
similar sequences, but structures that differ by a 
conformational change, function is less conserved 
than it is among similar sequences with more simi- 
lar structures. 

Perhaps the most important direction in which 
to further this work is the augmentation of the 
functional classification. With the growing 



amount of fully sequenced genomes there is a 
need for the development of a comprehensive 
system for functionally classifying proteins, a 
complete classification for the entire universe of 
protein functions. It will be a difficult process, 
as many existing organism-specific classifications 
will have to be merged, but the end result will 
have the advantage of not being biased towards 
any one organism. Such a universal classification 
will allow much more reliable transfer of func- 
tional annotation. 
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